Providing Internet Access to Portuguese Corpora: the AC/DC Project

نویسندگان

  • Diana Santos
  • Eckhard Bick
چکیده

In this paper we report on the activity of the project Computational Processing of Portuguese (Processamento computacional do português) in what concerns providing access to Portuguese corpora through the Internet. One of its activities, the AC/DC project (Acesso a corpora/Disponibilização de Corpora, roughly "Access and Availability of Corpora") allows a user to query around 40 million words of Portuguese text. After describing the aims of the service, which is still being subject to regular improvements, we focus on the process of tagging and parsing the underlying corpora, using a Constraint Grammar parser for Portuguese.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Corpora at Linguateca: Vision and roads taken

In the late nineties, access to Portuguese data in electronic form was scarce, and was considered one of the bottlenecks limiting the advance of natural language processing of Portuguese (Santos, 1999a), so Linguateca’s launching of AC/DC i had as purpose to significantly increase the amount of data – and its quality, in that the data was annotated and classified. To the best of my knowledge, A...

متن کامل

Experiments in Human-computer Cooperation for the Semantic Annotation of Portuguese Corpora

In this paper, we present a system to aid human annotation of semantic information in the scope of the project AC/DC, called corte-e-costura. This system leverages on the human annotation effort, by providing the annotator with a simple system that applies rules incrementally. Our goal was twofold: first, to develop an easy-to-use system that required a minimum of learning from the part of the ...

متن کامل

Linguateca's infrastructure for Portuguese and how it allows the detailed study of language varieties

In this paper I present briefly Linguateca, an infrastructure project for Portuguese which is ten years old, and will show how it provides several possibilities to study grammatical and semantical differences between varieties of the language. After a short history of Portuguese corpus linguistics, presenting the main projects in the area, I discuss in some detail the AC/DC project (Santos & Bi...

متن کامل

Providing On-line Access to Portuguese Language Resources: Corpora and Lexicons

Several Language Resources (LRs) for Portuguese, developed at the Center of Linguistics of the Lisbon University (CLUL), are available on-line at CLUL’s webpage: www.clul.ul.pt/english/sectores/projecto_rld.html. These LRs have been extracted from or developed based on the Reference Corpus of Contemporary Portuguese (CRPC), a monitor corpus containing, at the present, more than 300 million word...

متن کامل

)oruhvwd6lqwifwlfd$wuhhedqniru3ruwxjxhvh 6xvdqd$irqvr Blockin(fnkdug%lfn Blockin5hqdwr+dehu ‚ 'ldqd6dqwrv ‚ ,qwurgxfwlrq0rwlydwlrqdqgremhfwlyhv

$EVWUDFW This paper reviews the first year of the creation of a publicly available treebank for Portuguese, Floresta Sintá(c)tica, a collaboration project between the VISL and the Computational Processing of Portuguese projects. After briefly describing the main goals and the organization of the project, the creation of the annotated objects is presented in detail: preparing the text to be anno...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2000